Chapter 1 - Introduction to IODS course project

Brief introduction

Hello, I am Juho Mononen, a PhD student from UEF who who has some experience with Github and is quite familiar with R and Rstudio. Currently dappling in bioinformatics with a main focus of identifying regulatory sites affected by genetic variation. I am mostly self taught in programming, so participating to unlearn potentially bad habits and to learn better practices with project managing and statistics.

A friend introduced me to this course.

Things I expect to learn

  • Basics of version control and Github usage with R
  • Good practices with R and Rstudio

Chapter 2 - Regression and model validation

General features of the dataset

ASSIST dataset

We are going to analyse a subset of a data set from ASSIST (Approaches and Study Skills Inventory for Students) study which in its’ raw format can be found here. The study was on evaluating different learning approaches in students. The different learning methods covered were DEEP learning STRAtegic learning and SURFace learning (bolded sections of the learning methods represent the variable name in the dataset which is loaded below; students2014).

# Read file that was wrangled
students2014 <- read.delim("Data/learning2014.tsv", header = T, sep = "\t")

# Structure and dimensions
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  39 33 33 37 35 38 37 37 34 29 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# ...and dimensions
dim(students2014)
## [1] 166   7

As can be seen from the str() and dim() calls, the subset that we have consists of 7 variables with 166 observations each. Three of these variables are numeric (deep,stra and surf), one character/categorical (gender) and three integers (age,arttitude and points).

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally)
library(ggpubr)
library(tidyverse)
library(broom)

Getting to know the data

Now that we have our packages set up, we can proceed to take a closer look at the variables. Lets start with summary() which gives us some base statistics for the variables:

summary(students2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :13.00   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:30.25   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :34.00   Median :3.667  
##                     Mean   :25.51   Mean   :33.81   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:37.75   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Although this function provides us with much information, it is not that intuitive to assess larger differences from this output. We can tell that 166 males and females total are between ages 17-55 with mean age of 25.51. Also variables age, attidute and points are by average larger than deep, stra and surf. Lets make some plots to look into this data further. First lets make some adjustments to the students2014 to make it easier to plot with e.g. ggplot().

## Lets gather our multi-column data to three columns to make plotting and data editing with grouping easier
to.ggplot <- students2014 %>% gather(key="Variable", value = "Value", -gender) 
## Check that everything is in order
str(to.ggplot)
## 'data.frame':    996 obs. of  3 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Variable: chr  "age" "age" "age" "age" ...
##  $ Value   : num  53 55 49 53 49 38 50 37 37 42 ...

Now, if we want to perform statistical testing for the variables we should figure out which of the variables are normally and non-normally distributed. We can do this by performing Shapiro-Wilk test to the data. We can do this with the commands described below. Shapiro-wilk test and normality testing in R are described here in more detail.

## Lets test for normality so we can determine the best tests for statistical difference
normality.res <- to.ggplot %>%
  ## perform by groups
  group_by(Variable) %>% 
  ## create data.frame()-ception
  nest() %>% 
  ## perform shapiro-wilk to the nested DFs
  mutate(Shapiro.test = map(data, ~ shapiro.test(.x$Value))) %>% 
  ## use broom::glance to generate shapiro-wilk summary
  mutate(Shapiro.sum = Shapiro.test %>% map(glance)) %>%
  ## unlist the summary to make it more accessible
  unnest(Shapiro.sum)

## Save normally distributed and non-normally distributed variables into variables
normal.values <- normality.res$Variable[which(normality.res$p.value>0.05)]
non.normal <- normality.res$Variable[which(normality.res$p.value<=0.05)]

## Print the results
normal.values
## [1] "stra" "surf"
non.normal
## [1] "age"      "attitude" "deep"     "points"

It would seem that variables stra and surf are normally distributed (Shapiro-Wilk P-value>0.05) while age, attidute, deep and points are not (Shapiro-Wilk P-value \(\leq\) 0.05).

Generating some basic plots for data exploration

Since we don’t really have a grasp on ratio of females/males in our data we can generate a plot for that information. We could also test whether the distributions of certain variables between female and male are different. We can plot a box plot for distributions and perform independent T-test/unpaired two-samples Wilcoxon test depending on variable normality that we tested before.

## generate a simple bar plot for counts of male and female subjects in the data set
p1 <- ggplot(dplyr::select(students2014, gender), aes(x=gender, fill=gender)) +
  geom_bar(colour="black") + 
  ## expand the y-axis so that the count labels don't cut off
  scale_y_continuous(expand = expansion(mult = c(0, .2))) + 
  geom_text(stat='count', aes(label=..count..), vjust=-1) +
  scale_fill_manual(values = c("red", "blue")) +
  theme_bw(12) + 
  ## remove unnecessary x axis title
  theme(axis.title.x = element_blank())

## Plot normally distributed variables as bar plots and perform t.test using stat_compare_means
p2.n <- ggplot(dplyr::filter(to.ggplot, Variable %in% normal.values), 
               aes(x = gender, y = Value, fill = gender, group=gender)) +
  geom_boxplot() + 
  ## expand the y-axis so that the P-value labels don't cut off
  scale_y_continuous(expand = expansion(mult = c(0, .2))) +
  scale_fill_manual(values = c("red", "blue")) + 
  ## Separate variables into different facets
  facet_wrap(~Variable, scales="free", nrow=1) + 
  ## perform T-test
  stat_compare_means(comparisons = list(c("F","M")), method = "t.test", paired = FALSE) +
  theme_bw(12) + 
  ## remove unnecessary legend  as it is in p2.nn
  ## remove unnecessary x axis title
  theme(legend.position = "none",
        axis.title.x = element_blank()) ## remove unnecessary legend  as it is in p2.nn

## Plot normally distributed variables as bar plots and perform t.test using stat_compare_means
p2.nn <- ggplot(dplyr::filter(to.ggplot, Variable %in% non.normal), 
                aes(x = gender, y = Value, fill = gender, group=gender)) +
  geom_boxplot() + 
  ## expand the y-axis so that the P-value labels don't cut off
  scale_y_continuous(expand = expansion(mult = c(0, .2))) +
  scale_fill_manual(values = c("red", "blue")) +
  ## Separate variables into different facets
  facet_wrap(~Variable, scales="free", nrow=1) +
   ## perform T-test
  stat_compare_means(comparisons = list(c("F","M")), method = "wilcox.test", paired = FALSE) +
  theme_bw(12) + 
  ## remove unnecessary y-axis title as it is in p2.n
  ## remove unnecessary x axis title
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank()) 
## Combine plots with ggpubr's ggarrange() for labeling
ggarrange(p1, p2.n, p2.nn, labels = c("A", "B"), nrow=1, widths = c(1,1,2.3))
**Figure 1:** Basic plots for data exploration. **A)** Counts of data set subjects by gender. **B)** Box plots of the different variables.

Figure 1: Basic plots for data exploration. A) Counts of data set subjects by gender. B) Box plots of the different variables.

There are observations almost two times as many females in the data set as there are males (Figure 1A). Age seems to be the only variable that has different distribution between males and females, with males being higher aged (Wilcoxon P-value<0.05) (Figure 1B).

Performing regression analysis for points

Now that we are familiar with the data set we can move on to regression modelling. We are going to use multiple regression to test how well test-points can be predicted using 3 variables from the data set (more information on multiple regression here). We can select the e.g. the variables with highest correlation with points for the fit. While we are at it we can plot all of the inter-variables correlations in our data set using GGally::ggcor().

## remove gender as it is not numeric and gives error
ggcorr(students2014[colnames(students2014) != "gender"], palette = "RdBu", label = TRUE, label_round=3) 
**Figure 2:** Correlation plot for the different variables

Figure 2: Correlation plot for the different variables

Based on the results of the correlation analysis (Figure 2) we could use variables stra, surf and attidute for our explanatory variables for points. Lets build the model using lm() and view the summary of the results.

lm.data <- lm(points ~ attitude + stra + surf, data = students2014)
summary(lm.data)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7071  -3.3760   0.4279   3.6677  10.0219 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.87300    3.98295   3.232  0.00149 ** 
## attitude     0.33976    0.07784   4.365 2.26e-05 ***
## stra         0.69501    0.56790   1.224  0.22280    
## surf        -1.36841    0.82397  -1.661  0.09870 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 162 degrees of freedom
## Multiple R-squared:  0.1378, Adjusted R-squared:  0.1218 
## F-statistic: 8.627 on 3 and 162 DF,  p-value: 2.399e-05

Out of the variables used for fitting, only attidute seems to be a good addition to the model with -value < 2.26e-05. For every “global attitude toward statistics” point, exam points rise by ~0.34. Rest of the variables useed for the model did not reach significance (P-value<0.05), although with a looser cut-off (e.g. P-value<0.1) surf aka “Surface approach” seems to have negative impact on test score.

The model however, is quite unimpressive. Even though it is statistically significant (the independent variables explain the dependent variable, P-value<2.399e-05) it does not explain a lot of the whole situation. R2 value can be used to asses how good the model is at explaining the variability in the data. R2 is calculated in the following manner:

\[ R^2 = \frac{\text{Explained variation of the model}}{\text{Total variation of the model}} \]

The higher the R2 (closer to 1) the better. The multiple R2 of the model is a measly 0.1378 which means that the current model fails to predict most of the variability (predicts ~14% of it). In summary, the model explains the dependent variable but only on a small scale.

Assessing the quality of the model

We can now assess the models validity by plotting some diagnostic plots.

par(mfrow = c(2,2))
plot(lm.data, which=c(1,2,5))
**Figure 3:** Plots for Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

Figure 3: Plots for Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

From Figure 3 “Residuals vs Fitted” plot we can see that there are no worrying trends apparent in the plot. This suggest that we can assume linear relationship between dependent and independent variables. In Figure 3 “Normal Q-Q” plot we see that the points follow the reference line quite nicely and thus we can assume normality. In Figure 3” Residuals vs Leverage” we can see that there are no influential points (observations that affects the coefficients in the model summary drastically). If there were influential points we should pin-point the cause behind the outlier and possibly even remove it (if it is from erroneous measurement etc.).

# Date and session info
date()
## [1] "Mon Dec 13 03:41:28 2021"
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Finnish_Finland.1252  LC_CTYPE=Finnish_Finland.1252   
## [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C                    
## [5] LC_TIME=Finnish_Finland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] broom_0.7.10    forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
##  [5] purrr_0.3.4     readr_2.0.2     tidyr_1.1.4     tibble_3.1.5   
##  [9] tidyverse_1.3.1 ggpubr_0.4.0    GGally_2.1.2    ggplot2_3.3.5  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7         lubridate_1.8.0    assertthat_0.2.1   digest_0.6.28     
##  [5] utf8_1.2.2         R6_2.5.1           cellranger_1.1.0   plyr_1.8.6        
##  [9] backports_1.3.0    reprex_2.0.1       evaluate_0.14      httr_1.4.2        
## [13] highr_0.9          pillar_1.6.4       rlang_0.4.12       curl_4.3.2        
## [17] readxl_1.3.1       rstudioapi_0.13    data.table_1.14.2  car_3.0-11        
## [21] jquerylib_0.1.4    rmarkdown_2.11     labeling_0.4.2     foreign_0.8-81    
## [25] munsell_0.5.0      compiler_4.1.1     modelr_0.1.8       xfun_0.27         
## [29] pkgconfig_2.0.3    htmltools_0.5.2    tidyselect_1.1.1   rio_0.5.27        
## [33] reshape_0.8.8      fansi_0.5.0        tzdb_0.1.2         crayon_1.4.2      
## [37] dbplyr_2.1.1       withr_2.4.2        grid_4.1.1         jsonlite_1.7.2    
## [41] gtable_0.3.0       lifecycle_1.0.1    DBI_1.1.1          magrittr_2.0.1    
## [45] scales_1.1.1       zip_2.2.0          cli_3.1.0          stringi_1.7.5     
## [49] carData_3.0-4      farver_2.1.0       ggsignif_0.6.3     fs_1.5.0          
## [53] xml2_1.3.2         bslib_0.3.1        ellipsis_0.3.2     generics_0.1.1    
## [57] vctrs_0.3.8        cowplot_1.1.1      openxlsx_4.2.4     RColorBrewer_1.1-2
## [61] tools_4.1.1        glue_1.4.2         hms_1.1.1          abind_1.4-5       
## [65] fastmap_1.1.0      yaml_2.2.1         colorspace_2.0-2   rvest_1.0.2       
## [69] rstatix_0.7.0      knitr_1.36         haven_2.4.3        sass_0.4.0

Chapter 3 - Logistic regression

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally) ## plotting
library(ggpubr) ## package which contains ggplot2 + useful tools 
library(tidyverse) ## wrangling
library(gridExtra) ## for plotting tables 

General features of the dataset

Dataset & variables

In this session we will be exploring Student Performance Dataset which is based on data collected from students of two Portuguese schools. The data was collected from school reports and questionnaires. The full data can be found from here. Lets have a closer look inside the data:

# Read file that was wrangled + stringsAsFactors for glm in advance
alc <- read.delim("Data/alc.tsv", header = T, sep = "\t")

# Print colnames -> variables
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures.p"
## [16] "schoolsup"  "famsup"     "paid.p"     "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences.p"
## [31] "G1.p"       "G2.p"       "G3.p"       "failures.m" "paid.m"    
## [36] "absences.m" "G1.m"       "G2.m"       "G3.m"       "failures"  
## [41] "paid"       "absences"   "G1"         "G2"         "G3"        
## [46] "alc_use"    "high_use"

The data holds several levels of information for the students, for example, school, age, municipality type (rural/urban) and information about parents’ education and job status. It also has information on extra curricular activities on several levels (e.g. free time after school and going out with friends) on 1-5 scale. However, in this session we are in particular interested of the variables alc_use and it’s categorical version high_use which describe the weekly average and categorized normal/high alcohol usage respectively.

Graphical exploration of the variables

Even though the dataset contains many interesting variables that could relate to high alcohol usage lets select few here and see how, by-eye, they are distributed among students.

## selected variables to be plotted
variables <- c("activities", "goout", "studytime", "absences")

## Generate a list of plots
p.list <- lapply(variables, function(x) {
  
  ## Create a data.frame for plotting with variable name as one columns 
  tmp <- alc[x] %>% dplyr::rename(Value=x) %>% mutate({{x}}:=x)

  ## if value is numeric we want to plot distributions as violin plots
  if (is.numeric(tmp$Value)) {
     ggplot(tmp, aes_string(y="Value", x=x)) +
      geom_violin(fill="dodgerblue4") +
      scale_y_continuous(expand = expansion(mult = c(0, .2))) + # remove spaces of axis/plot
      theme_classic(14) +
      theme(axis.text.x =element_blank())
  ## Else we want to plot bar plots for categorical variable occurrences
  } else {
    ggplot(tmp, aes_string(x="Value")) +
      geom_bar(position = "dodge", fill="dodgerblue4") +
      geom_text(stat='count', aes(label=..count..), vjust=-1) +
      scale_y_continuous(expand = expansion(mult = c(0, .2))) + 
      theme_classic(14) +
      xlab(x)
    }
})
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(x)` instead of `x` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## plot the list of plots
ggarrange(plotlist = p.list, nrow = 1, align = "hv")
**Figure 1.** Interesting variables for alcohol consumption exploration

Figure 1. Interesting variables for alcohol consumption exploration

activities which stands for “extra-curricular activities” seems to be fairly equally distributed in no/yes categories among the students. goout (going out with friends, from 1 - very low to 5 - very high) seems to be quite evenly distributed around 3 meaning that most students feel like they partake in social activities in “normal manner”. studytime (weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)) on the other hand suggests that these students are not that hard working in general. High amount of absences (number of school absences) are rare.

Initial inspection of the relation of the variables and high-use

Lets have a look on how, by eye, these variables relate to alcohol usage and high usage. We’ll also have a quick glance on average alcohol usage and distribution of high-users among students.

## plot bar plot of high vs non high use
p1 <- ggplot(alc, aes(x=high_use, fill=high_use)) +
  geom_bar(position="dodge") +
  scale_fill_brewer(palette = "Set1") +
  geom_text(stat='count', aes(label=..count..), vjust=-1) +
  scale_y_continuous(expand = expansion(mult = c(0, .2))) + 
  theme_bw(14)

## plot distribution of alcohol use and color high use
p2 <- ggplot(alc, aes(x=alc_use, fill=high_use)) +
  geom_histogram(binwidth = 0.5,boundary=0) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(expand = expansion(mult = c(0, 0))) + 
  scale_x_continuous(expand = expansion(mult = c(0, 0))) + 
  theme_bw(14)

# for activities variable plot proportions among alchol usage
p3 <- ggplot(alc, aes(x=alc_use, fill=activities)) +
  geom_histogram(binwidth = 0.5, position="fill",boundary=0) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(expand = expansion(mult = c(0, 0))) + 
  scale_x_continuous(expand = expansion(mult = c(0, 0))) + 
  theme_bw(14)

## select numeric variables for plotting correlation plot. Lets use spearman for simplicity.
vars.to.cor <- c("goout", "studytime", "absences", "alc_use") 
p4 <- alc %>% 
  select(any_of(vars.to.cor)) %>% 
  ggcorr(palette = "RdBu", label = TRUE, label_round=3, method=c("everything", "spearman"))

## combine plots
ggarrange(p1,p2,p3,p4, nrow = 2, ncol = 2, labels = LETTERS[1:4])
**Figure 2.** **A)** Distribution of high versus normal use students. **B)** Distribution of alchohol usage among students. Coloring based on high versus normal use. **C)** Proportions of activity partaking students at different alcohol usage levels. **D)** Correlation values for all numeric variables of interest and alcohol usage.

Figure 2. A) Distribution of high versus normal use students. B) Distribution of alchohol usage among students. Coloring based on high versus normal use. C) Proportions of activity partaking students at different alcohol usage levels. D) Correlation values for all numeric variables of interest and alcohol usage.

There are less high-amount alcohol users (high-users) in the students than there are normal users (Figure 2A). Most of the students have also really low alcohol usage compared to the few high-users (Figure 2B). The one categorical value that was selected (activities) could be explored by proportions by alcohol usage. We can observe a slightly higher ratio of extra curricular activities among high and low alcohol usage but there are no major trends visible here (Figure 2C). Correlation analysis of the numeric variables hints that higher amount of school absences and going out with friends is connected with high-use. Time spent studying seems to have negative correlation (Figure 2D). However it is to be noted that all correlation are fairly low.

Based on this exploration we could hypothesize the following: 1. Activities are not predictive of high-use. 2. Absences and going out are predictive of more likely high-use. 3. Time spent studying is predictive of lower use.

Performing logistic regression on the chosen variables

Lets now perform logistic regression to see if we had our predictions/hypotheses correct.

## Read file that was wrangled
m <- glm(high_use ~ goout + studytime + absences + activities, data = alc, family = "binomial")

## compute odds ratios (OR)
OR <- coef(m) %>% exp

## compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
## save the odds ratios with their confidence intervals
res <- as.data.frame(cbind(OR, CI)) %>% 
  rownames_to_column("Variable") %>% 
  dplyr::filter(Variable!="(Intercept)")

## Generate OR plot
p.or <- ggplot(res, aes(x = OR, y = Variable)) + 
  geom_point(colour="red", size=3) + 
  geom_segment(aes(x = `2.5 %`, xend= `97.5 %`, yend=Variable), colour="red", lwd=1) +
  geom_vline(xintercept = 1, linetype="dashed") +
  ylab(NULL) +
  theme_bw(14)

## generate table plot for model summary
p.tab <- summary(m)$coefficients %>% as.data.frame() %>%
  mutate_all(function(x) signif(x, digits = 3)) %>% 
  tableGrob()

## Combine plots
ggarrange(p.or, p.tab, nrow=1, labels = c("A","B"))
**Figure 3.** **A)** Odds ratios for the variables. Points represent the OR and seqments the confidence interval. **B)** Summary of the GL model variables.

Figure 3. A) Odds ratios for the variables. Points represent the OR and seqments the confidence interval. B) Summary of the GL model variables.

It seems that we were somewhat correct on our assumptions. Going out with friends boasts odds ratio (OR) of mighty 2.09 (Figure 3A) and is significantly associated based on model summary (Figure 3B). Studying time has a significant negative (OR=0.57) association with high-use. For the other variables things aren’t so straight forward. Absences, while being significantly associated, with high-use presents only a minimal OR of 1.07. For our categorical variable, activities the shows that there is no significant difference observed between those students partaking in activities compared to those who are not in regards of high-use (Figure 3A-B). These results are quite similar to our hypotheses before, apart from the really low impact of absences.

Based on these results we don’t need to eliminate any of the numerical values from our model. How ever as explained here we should perform likelihood ratio test for activities to test whether we should keep it in the model or not.

test.mod1 <- glm(high_use ~ goout + studytime + absences + activities, data = alc, family = "binomial") # with rank
test.mod2 <- glm(high_use ~ goout + studytime + absences, data = alc, family = "binomial") # without rank

anova(test.mod1, test.mod2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + studytime + absences + activities
## Model 2: high_use ~ goout + studytime + absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       365     375.68                     
## 2       366     377.16 -1  -1.4798   0.2238

The test is non-significant (P=0.2238) so we can remove activities from our model.

Exploring the power of the model

Refitting and tabulations

Lets now re-fit and perform some tabulations and prediction to see how strong our model is. Lets start by tabulating the outcomes based on our model

## Generate model with significant variables
m2 <- glm(high_use ~ goout + studytime + absences, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions. Unclass here enables us to keep 2x2 format with tableGrob
p.tab1 <- table(high_use = alc$high_use, prediction = alc$prediction) %>% unclass() %>% 
  tableGrob()

## plot prop table
p.tab2 <- table(high_use = alc$high_use, prediction = alc$prediction) %>% 
  prop.table() %>% 
  round(digits = 2) %>%
  addmargins() %>% 
  tableGrob()

## Combine plots
ggarrange(p.tab1, p.tab2, nrow=1, labels = c("A", "B"))
**Figure 4.** **A** 2x2 tabulation. **B** Proportion table

Figure 4. A 2x2 tabulation. B Proportion table

It seems that our model has quite good predictability. False-positive and -negative rates are somewhat high (0.32 and 0.21 respectively) (Figure 4A) which can also be seen from proportion table (Figure 4B).

Testing by quessing

Lets test our model predictivness compared to random guessing using the a function from the datacamp exercise (loss_func), and by using some nonsensical probabilities. First lets get the prediction error when we quess that everyone is high-user and secondly that no one is high-user and compare these results to our models prediction error.

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

## Lets make a quess with probability of high use as 1
loss_func(class = alc$high_use,
          prob = 1)
## [1] 0.7
## Lets make a quess with probability of high use as 0
loss_func(class = alc$high_use,
          prob = 0)
## [1] 0.3
## Lets make a quess with probability of high as predicted by our model
loss_func(class = alc$high_use,
          prob = alc$prediction)
## [1] 0.2324324

As we can see our model has the lowest prediction error when compared to the random guesses.

10-fold cross validation with the latter model

Additionally we can perform cross-validation for the model to test its’ robustness.

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486486

As we can see with cross validation our model has smaller prediction error compared to the model introduced in DataCamp exercise (0.24<0.26).

date()
## [1] "Mon Dec 13 03:41:30 2021"
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Finnish_Finland.1252  LC_CTYPE=Finnish_Finland.1252   
## [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C                    
## [5] LC_TIME=Finnish_Finland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] boot_1.3-28     gridExtra_2.3   broom_0.7.10    forcats_0.5.1  
##  [5] stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_2.0.2    
##  [9] tidyr_1.1.4     tibble_3.1.5    tidyverse_1.3.1 ggpubr_0.4.0   
## [13] GGally_2.1.2    ggplot2_3.3.5  
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2         sass_0.4.0         jsonlite_1.7.2     carData_3.0-4     
##  [5] modelr_0.1.8       bslib_0.3.1        assertthat_0.2.1   highr_0.9         
##  [9] cellranger_1.1.0   yaml_2.2.1         pillar_1.6.4       backports_1.3.0   
## [13] glue_1.4.2         digest_0.6.28      RColorBrewer_1.1-2 ggsignif_0.6.3    
## [17] rvest_1.0.2        colorspace_2.0-2   cowplot_1.1.1      htmltools_0.5.2   
## [21] plyr_1.8.6         pkgconfig_2.0.3    haven_2.4.3        scales_1.1.1      
## [25] openxlsx_4.2.4     rio_0.5.27         tzdb_0.1.2         generics_0.1.1    
## [29] farver_2.1.0       car_3.0-11         ellipsis_0.3.2     withr_2.4.2       
## [33] cli_3.1.0          magrittr_2.0.1     crayon_1.4.2       readxl_1.3.1      
## [37] evaluate_0.14      fs_1.5.0           fansi_0.5.0        MASS_7.3-54       
## [41] rstatix_0.7.0      xml2_1.3.2         foreign_0.8-81     tools_4.1.1       
## [45] data.table_1.14.2  hms_1.1.1          lifecycle_1.0.1    munsell_0.5.0     
## [49] reprex_2.0.1       zip_2.2.0          compiler_4.1.1     jquerylib_0.1.4   
## [53] rlang_0.4.12       grid_4.1.1         rstudioapi_0.13    labeling_0.4.2    
## [57] rmarkdown_2.11     gtable_0.3.0       abind_1.4-5        DBI_1.1.1         
## [61] reshape_0.8.8      curl_4.3.2         R6_2.5.1           lubridate_1.8.0   
## [65] knitr_1.36         fastmap_1.1.0      utf8_1.2.2         stringi_1.7.5     
## [69] Rcpp_1.0.7         vctrs_0.3.8        dbplyr_2.1.1       tidyselect_1.1.1  
## [73] xfun_0.27

Chapter 4 - Clustering and classification

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally) ## plotting
library(ggpubr) ## package which contains ggplot2 + useful tools 
library(tidyverse) ## wrangling
library(MASS) ## for the data and analysis

General features of the dataset

Dataset & variables

In this session we will be exploring Boston dataset which is part of MASS data analysis R package. The dataset contains information on the housing values in suburbs of Boston and was collected for a publication on methodological problems that come with the usage of housing market data in measuring willingness to pay for clean air*.

*Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

## load boston dataset to get aa clean version
data("Boston")

## check the contents
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The dataset has 14 variables 12 of which seem to be continuous and 2 integer. The dataset has 506 observations. Lets also have a look at the sumary of variables as this will be important to note later.

## load boston dataset to get aa clean version
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

As we can see all of the values are positive and have varying means and scales. For example, nox has values ranging from 0.385-0.871 while for tax the values are in the range of 187-711.

Graphical overview of the data

Lets have a look at the data to see if there are any trends visible between the variables.

## Shamelessly taking borrowing code from here: https://stackoverflow.com/questions/57450913/removing-background-color-for-column-labels-while-keeping-plot-background-color

## this functions creates correlation plots (colour scale for correlation and numeric representation) that can be used with ggpairs 
cor_func <- function(data, mapping, method, symbol, ...){
  
  # get mapping information from aes
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # calculate correlation
  corr <- cor(x, y, method=method, use='complete.obs')
  # set colour scale that is used for all correlations
  colFn <- colorRampPalette(c("brown1", "white", "dodgerblue"), 
                            interpolate ='spline')
  
  # get the index of the colour for the correlation observecd with the variable
  fill <- colFn(100)[findInterval(corr, seq(-1, 1, length = 100))]
  
  # generate text plot with rounded correlation
  ggally_text(
    label = paste(symbol, as.character(round(corr, 2))), 
    mapping = aes(),
    xP = 0.5, yP = 0.5,
    color = 'black',
    ...
  ) +
    # plot the background panel using the correlation colour in fill
    theme(panel.background = element_rect(fill = fill))
}

## Plot pairs plots with ggpairs
ggpairs(Boston, 
        upper = list(continuous = wrap(cor_func, method = 'spearman', symbol = expression('\u03C1 ='))),
        diag = list(continuous = function(data, mapping, ...) {ggally_densityDiag(data = data, mapping = mapping) + 
                  theme(panel.background = element_blank())})) + 
  theme(strip.background = element_rect(fill = "white"),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank())

nox and crim seem to have highest positive correlation out of the variable pairs while nox and dim seem to have the highest positive correlation. Distributions of the values of variable pairs seem to vary greatly.

Setting up the data set for data analysis

Standardizing variables in the dataset

If we want to perform clustering and related analysis we should standardize the data. We can use stats::scale() for this. With default settings is standardizes variable (creates Z-score) by subtracting the mean and dividing by the standard deviation.

# center and standardize variables and transform the matrix into a dataframe
boston_scaled <- scale(Boston) %>% as.data.frame()

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

As we can see the mean value for each variable after standardization is now zero compared to the original summary which had varying means. Due to centering we also have negative values which the original data set did not have.

Creating test and training datasets

Next we will generate a categoriocal value out of crime rate by quantiles (“low”, “med_low”, “med_high”, “high”) and add “replace” the crim variable with it.

# create a quantiles of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels=c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim) %>% 
# add the new categorical value to scaled data
  dplyr::mutate(crime=crime)

Now we set our train and test datasets by subsetting observations from the scaled Boston dataset. We will use 80% of the observations for training and 20% for testing.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# set seed for reproducibility
set.seed(123)
# choose randomly 80% of the 
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

#save the correct classes from test data
correct_classes<- test$crime

#remove the crime variable from test data
test<- dplyr::select(test, -crime)

Performing linear discriminant analysis on the data

Linear discriminant analysis

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.01866313 -0.9066422 -0.08120770 -0.8835724  0.38666334 -0.9213895
## med_low  -0.07141687 -0.3429155  0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591  0.2162741  0.19544522  0.3637157  0.12480815  0.4564260
## high     -0.48724019  1.0149946 -0.03371693  1.0481437 -0.47733231  0.8274496
##                 dis        rad        tax    ptratio      black        lstat
## low       0.9094324 -0.6819317 -0.7458486 -0.4234280  0.3729083 -0.766883775
## med_low   0.3694282 -0.5395408 -0.4205644 -0.1079710  0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959  0.1049654  0.009720124
## high     -0.8601246  1.6596029  1.5294129  0.8057784 -0.6383964  0.900379309
##                 medv
## low       0.47009410
## med_low   0.01548761
## med_high  0.19440747
## high     -0.71294711
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## chas     0.002460776  0.03963849  0.1699312
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113

As we can see from the summary of the model proportions of the crime rate categories are quite equal in our training dataset. Percentage separations achieved by the discriminant functions however are quite different: LD1 captures 95.23% of differences between the groups, while LD2 and LD3 add 3.64 and 1.13 respectively.

We can now plot LDA (bi)plot to see differences among the groups.

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 1.5)

As we can see from the the arrows in the (bi)plot variable rad shows most discrimination (longest arrow) compared to the other factors with smaller influences. It also seems to be one of the major factors predictive of high crime rate.

Assessing predictiveness of the model

Lets use our test dataset now to check how accurately our model can predict crime rate.

# generate predictions using the test dataset
lda.pred <- predict(lda.fit, newdata=test)

#  cross tabulate the results with the crime categories
table(correct= correct_classes, predicted= lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        4    0
##   med_low    2      17        6    0
##   med_high   1       9       13    2
##   high       0       0        1   27
# get by column proportions of classifications to compare predictiveness of the model across the groups 
table(correct= correct_classes, predicted= lda.pred$class) %>% prop.table(margin = 2)
##           predicted
## correct           low    med_low   med_high       high
##   low      0.76923077 0.27777778 0.16666667 0.00000000
##   med_low  0.15384615 0.47222222 0.25000000 0.00000000
##   med_high 0.07692308 0.25000000 0.54166667 0.06896552
##   high     0.00000000 0.00000000 0.04166667 0.93103448

As we can see from the tabulation the model predicts low and high crime rate quite accurately (77% and 93% correct predictions respectively) but is not as good in predicting medium low and high crime rates (47% and 54% correct predictions respectively).

Distance analysis of the variables

Performing Distance analysis

Lets perform distance analysis analysis for the observations in the dataset.

As something that is totally extra to the exercise we will also plot a heatmap with the previously used crime rate quantiles as annotation so we can see if the distances are smaller between observations belonging to same crime rate categories!

# reload boston
data("Boston")

# Z-score variables
boston_scaled <- scale(Boston)

# center and standardize variables
dist.b <- dist(boston_scaled)

# create a quantiles of crim
bins <- quantile(boston_scaled[,"crim"])

# create a character vector of colours based on the crime rate  
crime <- cut(boston_scaled[,"crim"], breaks = bins, include.lowest = TRUE, labels=c("darkblue", "lightblue", "rosybrown", "brown")) %>% as.character()

## lets also plot the distances as a heatmap for fun 
library(gplots, quietly = T)
## Warning: package 'gplots' was built under R version 4.1.2
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
heatmap.2(as.matrix(dist.b), trace="none", 
          labRow = FALSE, labCol = FALSE,
          ColSideColors=crime,
          RowSideColors=crime)

As we can see from the heatmap high crime rate categories cluster together quite nicely (darker red colour -> R’s “brown”) and are quite distinct from the bulk of low and medium low groups which are also quite similar to each other (blue colours).

Performing and optimizing K-means clustering

Based on the distance heatmap above we could try starting of the K-means clustering using 3-4 centers. We’ll use 3 for starters.

# Calculate clusters using K means
km <-kmeans(boston_scaled, centers = 3)

# Print the centers
km$centers
##         crim         zn      indus       chas        nox         rm        age
## 1  0.7847946 -0.4872402  1.1450405 -0.2723291  1.0384727 -0.4896488  0.8062002
## 2 -0.3882449  0.2731699 -0.6264383 -0.2723291 -0.5823006  0.2188304 -0.4585819
## 3 -0.2048299 -0.1564737  0.2306535  3.6647712  0.3342374  0.3344149  0.3170678
##          dis         rad        tax    ptratio      black      lstat       medv
## 1 -0.8383961  1.12436056  1.2034416  0.6329916 -0.6397959  0.9277624 -0.7457491
## 2  0.4807157 -0.58641200 -0.6161585 -0.2814183  0.3151747 -0.4640135  0.3182241
## 3 -0.3634565 -0.02700292 -0.1304164 -0.4453253  0.1787986 -0.1976385  0.6422884

Our clusters have somewhat distinct means of values but we could optimize our clustering by performing, deviating from the DataCamp exercise, Gap-statistic based goodness of clustering measure which is more closely explained here. Gap statistic is more thoroughly explained here. Simply put, Gap statistic can be used to choose the number of k, where the biggest jump in within-cluster distance has occurred, based on the overall behavior of uniformly drawn samples*.

One of the main reasons I chose to use gap statistic instead of the within-cluster sum of square (WCSS), that was used in the Datacamp exercise, is that we can decide k (number of clusters) by not being reliant on an interpretation of a graph, but rather an outcome of a equation. In this case we will use the 1-SE rule which looks for the smallest k such that its value f(k) is not more than 1 standard error away from the first local maximum.

Lets perform the analysis and plot gap-statistic with

*Towards Data Science blog post on K-mean and gap-statistic, 28.11.2021

library(cluster)
# perform gap-statistic analysis on the dataset, running 1000 bootstraps
# set seed for bootstrapping reproducibility
set.seed(123)
t.km <- clusGap(boston_scaled, kmeans, 20, B = 1000, iter.max=50) 

# output the number of clusters by 1-SE rule 
k.n <- maxSE(t.km$Tab[,3], t.km$Tab[,4], method = "firstSEmax")

# create a dataframe for plotting
gs.df = data.frame(t.km$Tab, k=1:nrow(t.km$Tab))

# plot with ggplot
ggplot(gs.df, aes(k, gap)) + 
  geom_line() + 
  geom_vline(xintercept = k.n) +
  geom_point() + 
  geom_errorbar(aes(ymax=gap+SE.sim, ymin=gap-SE.sim)) + 
  theme_bw(12)

The optimal amount of clusters is defined as 10 by the analysis.

Visualizing the results

We will now plot a pairs plot to see how these clusters are represented among variable pairs.

# Calculate clusters using K means with optimal n
km.res <-kmeans(boston_scaled, centers = k.n)

# create vector of clusters
clusters <- km.res$cluster

# generate a variable for cluster
to.plot <- boston_scaled %>% as.data.frame() %>% mutate(cluster=paste("cl", clusters, sep = "_"))

# take other variables than cluster for plotting 
v.to.plot <- colnames(to.plot)[colnames(to.plot)!="cluster"]

# plot pairs plot
ggpairs(to.plot,
        # set variables to plot
        columns = v.to.plot, 
        # set colour mapping to clusters annoyingly alpha also plots to legend and did not quickly figure out how to remove it without setting up errors...
        aes(colour=cluster, alpha=0.8), 
        # plot legend in first slot
        legend = 1,
        # change the upper plot to points as well
        upper=list(continuous = "points", combo = "facethist", discrete = "facetbar", na ="na")) + 
  # move legend to bottom
  theme(legend.position = "bottom") 

The plot is quite colorful with 10 clusters indeed but there are noticeable things here still. When we look at the different density plots in the middle (while it is quite hard to look at…) we can see that in the different variable value distributions the clusters separate the observations quite nicely. For example with zn and indus separation of the “cl_8” and “cl_9” clusters differs. In the paired variable point plots we can see, for example, for black the separation of “cl_1” and “cl_3” from the others and for zn the separation of the “cl_6”. This visualization might not be the best but will suffice this time…

date()
## [1] "Mon Dec 13 03:43:14 2021"

Chapter 5 - Dimensionality reduction techniques

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally) ## plotting
library(ggpubr) ## package which contains ggplot2 + useful tools
library(ggfortify) ## for easier PCA plotting
library(tidyverse) ## wrangling
library(FactoMineR) ## analysis
library(factoextra) ## visualization tools for clustering etc.

General features of the dataset

Overview of the variables

The dataset for this weeks exercise originates from United Nations Development Programme.

# load data
human <- read.delim("Data/human.tsv", header = T, sep = "\t")
# show structure 
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ eduF2M   : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labF2M   : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ edExp    : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ lifeExp  : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNIpc    : num  65 42.3 56.4 44 45.4 ...
##  $ matMort  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ birthRate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parlRep  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# show summary
summary(human)
##      eduF2M           labF2M           edExp          lifeExp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##      GNIpc            matMort         birthRate         parlRep     
##  Min.   :  1.123   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  5.275   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 13.054   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 46.496   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 27.256   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :908.000   Max.   :1100.0   Max.   :204.80   Max.   :57.50

The dataset has 155 observations and 8 variables. All of the variables are numeric and one of them more specifically integer. The scales of the values are varied, which is due to some of them being ratios of rations (e.g. eduF2M), some percentages (parlRep) etc.

Here are the explanations for these variables: GNIpc = Gross National Income per capita lifeExp = Life expectancy at birth edExp = Expected years of schooling matMort = Maternal mortality ratio birthRate = Adolescent birth rate parlRep = Percetange of female representatives in parliament edu2F = Proportion of females with at least secondary education edu2M = Proportion of males with at least secondary education labF = Proportion of females in the labour force labM Proportion of males in the labour force eduF2M = edu2F/edu2M labF2M = labF/labM

Relationships between variables

Now lets plot the already classic pairs plot to delve into the relationships between the variables.

# this functions creates correlation plots (colour scale for correlation and numeric representation) that can be used with ggpairs 
cor_func <- function(data, mapping, method, symbol, ...){
  
  # get mapping information from aes
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # calculate correlation
  corr <- cor(x, y, method=method, use='complete.obs')
  # set colour scale that is used for all correlations
  colFn <- colorRampPalette(c("brown1", "white", "dodgerblue"), 
                            interpolate ='spline')
  
  # get the index of the colour for the correlation observed with the variable
  fill <- colFn(100)[findInterval(corr, seq(-1, 1, length = 100))]
  
  # generate text plot with rounded correlation
  ggally_text(
    label = paste(symbol, as.character(round(corr, 2))), 
    mapping = aes(),
    xP = 0.5, yP = 0.5,
    color = 'black',
    ...
  ) +
    # plot the background panel using the correlation colour in fill
    theme(panel.background = element_rect(fill = fill)) + 
  theme(strip.background = element_rect(fill = "white"),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) 
}
# draw pairs plot
ggpairs(human,upper = list(continuous = wrap(cor_func, method = 'spearman', symbol = expression('\u03C1 ='))))
**Figure 1.** Overview of the variables in the human dataset

Figure 1. Overview of the variables in the human dataset

As we can see from the from Figure 1. the variables have varying distributions among observations. For example, Gross National Income per capita (GNIpc) is mostly focused around 0 with six outlier(ish) observations quite separate from the rest (>500). Maternal mortality ratio (matMort) and life expectancy at birth (lifeExp) have strongest negative correlation while maternal mortality ratio and adolescent birth rate (birthRate) have strongest positive correlation among the variable pairs.

Principal component analysis

Performinng PCA on unstandardized and standardized data

Next we will perform Principal component analysis (PCA) on both unstandardized and standardized dataset to see how standardizing effects the interpretation of PCA results. It could be noted here that it is generally recommended to normalize data before PCA.

We’ll do both of the plots in single code block for convenience and add some titles to the plots alongside the asked captioning with fig.cap in markdown.

## PCA without scaling
pca.results <- prcomp(human, scale. = F)

## plot with ggfortifys autoplot (more info here: https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html)
p1 <- autoplot(pca.results,
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3) +
  theme_bw(14) +
  ggtitle("Unscaled PCA")

## PCA with scaling
pca.results.sc <- prcomp(human, scale. = T)

## plot with ggfortifys autoplot
p2 <- autoplot(pca.results.sc,
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3) +
  theme_bw(14) +
  ggtitle("Scaled PCA")

## combine pots
ggarrange(p1, p2, align = "hv", labels = c("A", "B"))
**Figure 2. A) PCA on unstandardized data.** Initial PCA analysis with unstandardized data hints that maternal mortality (`matMort`) could be one on the key segragating factors between the countries. **B) PCA on standardized data.** With standardization we can now more reliably see that there are several variables, such as life expenctancy (`lifeExp`) and adolescent birth rate (`birthRate`), that explain the divergence between countries.

Figure 2. A) PCA on unstandardized data. Initial PCA analysis with unstandardized data hints that maternal mortality (matMort) could be one on the key segragating factors between the countries. B) PCA on standardized data. With standardization we can now more reliably see that there are several variables, such as life expenctancy (lifeExp) and adolescent birth rate (birthRate), that explain the divergence between countries.

Interpreting the outcomes of PCA

When we compare the unscaled (Figure 2A) and scaled data (Figure 2B) we can straight of see the importance of standardizing/scaling/normalization. With unscaled data, Gross National Income per capita (GNIpc) and maternal mortality ratio (matMort) which were both quite heavily skewed right in their distributions with higher values (Figure 1) are major drivers of explained variance in the dataset. This observation is due to the differences in scales of the values. For example, with variable labF (Proportion of females in the labor force) an unit change of 0.1 to 0.9 could clearly be viewed as significant change, but with GNIpc the same numeric change e.g. 300.1 to 300.9 is obviously of less impact and vice versa. When we scale the data we make the variables more comparable and thus can observe better the effects. In Figure 2B we can see that when the scale of the values is accounted for, there are several more variables explaining the observed variance. Life expectancy (lifeExp), maternal mortality ratio (matMort) and adolescent birth rate (birthRate) seem to be fairly connected with PC1 which explains lot more variance compared to PC2 (49.85% vs 16.43% respectively).

Multiple Correspondence Analysis on the tea data

Selecting variables for and performing MCA

For Multiple Correspondence Analysis (MCA) step we will use tea dataset from FactoMineR R package. Some information from here about the data:

The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).

Lets do some summarization of the data using some basic commands.

# load tea
data("tea")
# show structure 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# show summary
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

The dataset has 300 observations and 36 variables. 35 of these are categorical variables as factors and one of these is integer variable (age).

When performing MCA it is good practice to leave out variables which have factors with next to none observations. Lets plot some bar plots and select 10 variables to use in the analysis.

# Draw summary in the form of bar plots
gather(tea) %>% ggplot(aes(value)) + 
  geom_bar() + 
  facet_wrap("key", scales = "free") + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) ,axis.title = element_blank()) ## tidy up
## Warning: attributes are not identical across measure variables;
## they will be dropped
**Figure 3.** Box plots of the counts of different factors across the variables in `tea`.

Figure 3. Box plots of the counts of different factors across the variables in tea.

Lets just select some variables with not super-biased distributions seen in the barplots of Figure 3., like age_Q, breakfast, diuretic, escape.exoticism, exciting, feminine, frequency, SPC, tea.time and sugar.

We’ll do MCA and then plot the results using handy ggplot wrappers from factoextra R package. The plots that we are going to use are scree plot (overview of the inertia retained by the dimensions) and biplot with squared cosine (cos2) coloring of the variables. cos2 measures the degree of association between variable categories and a particular axis.

## selecting 10 variables that have reasonable counts across the categories
vars2mca <- c("age_Q", "breakfast", "diuretic", "escape.exoticism", "exciting", "feminine", "frequency", "SPC", "tea.time", "sugar")

## perform MCA on the selecteed variables
res.mca <- MCA(tea[,vars2mca], graph = F)

## using factoextra visualization functions generate some plot from MCA:
## first a visualization of the percentages of variance explained by the different dimensions
p1 <- fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 15))
## Secondly, a biplot with individuals (without labels) and variables.
p2 <- fviz_mca_biplot(res.mca, 
                      label = "var",
                      repel = TRUE,
                      ggtheme = theme_minimal(),
                      col.var = "cos2",
                      gradient.cols = c("blue", "red"))
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## combine the plots with labeling
ggarrange(p1, p2, labels = c("A", "B"), widths=c(1,1.8), ncol = 2)
**Figure 4. A)** Percentages of inertia explained MCA dimensions. **B)** MCA biplot of individuals and variable categories.

Figure 4. A) Percentages of inertia explained MCA dimensions. B) MCA biplot of individuals and variable categories.

Interpreting the outcomes of MCA

The variables that were selected are explaining only a small proportion of the variance among the observations in different dimensions. From the scree plot (Figure 4A) we can see that dimensions 1-10 at maximum are explaining roughly 10% of the observed variation. From the biplot (Figure 4B) we can see that student category of SPC and +60 category of age_Q both have high quality of representation (cos2) on both Dim1 and Dim2.

date()
## [1] "Mon Dec 13 03:43:21 2021"

Chapter 6 - Analysis of longitudinal data

Setting up the contents for the exercise

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally)
library(ggpubr)
library(tidyverse)
library(lme4)

Reading in the two data sets

First the BPRS dataset which as described in Multivariate Analysis for the Behavioral Sciences text book* contains data of 40 male subjects from two treatment groups which were assessed using brief psychiatric rating scale (BPRS) weekly for 8 weeks.

# reading in 
BPRS <- read.table("Data/BPRS.tsv", header = T) %>%
  mutate_at(1:2, as.factor)  
# view structure
str(BPRS)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

Second dataset for this exercise is RATS which contains weight information gathered from rats on three different diets over a 9-week period. Measurements were mainly taken on a weekly basis*.

# reading in 
RATS <- read.table("Data/RATS.tsv", header = T) %>%
  mutate_at(1:2, as.factor)
# view structure
str(RATS)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

**Vehkalahti, Kimmo & Everitt, Brian S. (2019). Multivariate Analysis for the Behavioral Sciences , Second Edition. Chapman and Hall/CRC, Boca Raton, Florida, USA.

Graphical displays and summary measure approach using RATS data

General features of the data

Lets start by plotting the longitudal data with and without standardization.

# Draw the plot
p1 <- ggplot(RATS, aes(x = Time, y = Weight, linetype = ID, color=Group)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:16, times=3)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS$Weight), max(RATS$Weight))) + 
  theme_bw(14) +
  ylab("Weight (g)") +
  theme(legend.position = "none")

RATS.scaled <- RATS %>% 
  group_by(Time) %>% 
  mutate(std.weigth=(Weight-mean(Weight))/sd(Weight)) %>%
  ungroup()

p2 <- ggplot(RATS.scaled, aes(x = Time, y = std.weigth, linetype = ID, color=Group)) +
  geom_line() +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS.scaled$std.weigth), max(RATS.scaled$std.weigth))) + 
  theme_bw(14) +
  ylab("Standardized weight (g)") +
  theme(legend.position = "none")

ggarrange(p1,p2, labels = "AUTO")
**Figure 1: A)** Weights by days of individual rats separated by groups **B)** Standardized weights by groups of mice

Figure 1: A) Weights by days of individual rats separated by groups B) Standardized weights by groups of mice

Based Figure 1A we could boldly assume that the rats are gaining weight on the diets (esp. groups 2&3). There seems to be some differences among the weight gain however on different diets. From Figure 1B we can see that while there are some instances were “tracking” seems to be evident there are many cases against it also . Also, group two seems to have a quite large specimen compared to the peers!

Summary measures for BPRS

Lets delve deeper into the data. Making observations based on plots consisting of individuals is of ten misleading so we’ll summarize the data bit.

# Summary data with mean and standard error of bprs by treatment and week 
RATS.SE <- RATS %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = sd(Weight)/sqrt(16)) %>% # 16 here is the amount of individuals per groups
  ungroup()

# Plot the mean profiles
p1 <- ggplot(RATS.SE, aes(x = Time, y = mean, colour=Group)) +
  geom_line() +
  geom_point(size=3) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=0.3) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)") + 
  theme_bw()

# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0).
RATS.scaled.gr <- RATS %>%
  dplyr::filter(Time>1) %>%
  group_by(Group, ID) %>%
  summarise(scaled.mean=mean(Weight)) %>%
  ungroup()

p2 <- ggplot(RATS.scaled.gr, aes(x = Group, y = scaled.mean, fill=Group)) +
  geom_boxplot() +
  facet_grid(.~Group, scales = "free_x", labeller = label_both) + 
  stat_summary(fun = "mean", geom = "point", shape=23, size=4) +
  scale_y_continuous(name = "mean(Weight), days 1-64") +
  theme_bw(12) + 
  theme(legend.position = "none")

# Combine plots
ggarrange(p1,p2, labels = "AUTO", widths = c(2,1), nrow = 1)
**Figure 2: A)** Group means with standard errors across the time points **B)** Means of individuals off timepoints by group

Figure 2: A) Group means with standard errors across the time points B) Means of individuals off timepoints by group

From Figure 2A we can see that standard error in group two is far greater than that of groups 1&3, this probably be due to the large individual highlighted in the Figure 1 . Using summary measure approach we can see that there are outlier-ish values across all of the groups. However, since it is generally advisable to leave outliers be unless they can be contributed to measurement error etc., we’ll probably remove only the most deviating one (Group 2) and even that just for practice.

Testing for differences using summary measure approach

Now we can perform some test for the data. First we’ll remove the “outlier” and continue from there bny performin three-way anova and t-tests for group comparisons.

# Create a new data by filtering the outlier 
RATS.f <- RATS %>% 
  dplyr::filter(Weight<550) %>% ## remove the one "outlier"
  filter(Time > 1 ) %>% # remove baseline
  group_by(Group, WD) %>%
  summarise( mean = mean(Weight) )%>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
comps <- list( c("1", "2"), c("1", "3"), c("2", "3") )

ggplot(RATS.f, aes(x = Group, y = mean, fill=Group)) +
  geom_boxplot() +
  stat_compare_means(method = "t.test", comparisons = comps) +
  stat_compare_means(label.y = 615, method = "anova") + 
  scale_y_continuous(name = "mean(Weight), days 1-64") +
  theme_bw(12) + 
  theme(legend.position = "none")
**Figure 3:** Summary approach. P-values for T-test unless annotated differently

Figure 3: Summary approach. P-values for T-test unless annotated differently

Every groups between each other and in pairwise comparisons are significantly different it seems

Linear mixed effects models for BPRS data

A glimpse into the data set

Lets start by taking a quick look at the data and misleadingly not with glimpse (terrible pun(?)) but by plotting BPRS’s of the different subjects band treatments treatments.

# Plot the BPRS data
ggplot(BPRS, aes(x = week, y = bprs, colour = treatment, linetype=subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  scale_x_continuous(name = "Week") +
  scale_y_continuous(name = "BPRS") +
  theme_pubr()
**Figure 4:** Plot of the time series data

Figure 4: Plot of the time series data

From Figure 4 we can see that there is a general decline in BPR score during the treatment period, but it is quite hard to see whether there are differences between the groups from this plot.

Fitting linear model

To gain better understanding of how treatment and time relate to BRPS lets fit a linear model using week and treatment as explanatory variables for brps.

# create a regression model RATS_reg
bprs_reg <- lm(bprs ~ week + treatment,BPRS)

# print out a summary of the model
summary(bprs_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

From the model we can see that week is (really highly) significant variable in the model and contributes to lower BRPS (estimate ~-2.3), thus based on this model, which assumes independence of measurements for BRPS, treatment time decreases BRPS. Based on R-squared the model explains only 18.5% of the variability. We can quite safely assume that the measurements are not independent from one another so lets keep on modelling.

Fitting a random intercept model

A better model to study the effects could be random intercept model which allows the linear regression fit for each subject to differ in intercept from the other subjects. Lets generate the model.

# Create a random intercept model
bprs_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)

# Print the summary of the model
summary(bprs_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

This model also suggest week (time to be important variable). ### Fitting a random slope model Now lets move on to random slope model which helps us to analyse BRPS profiles with time when We can using this and the previous (random intercept model) account for the individual differences in BRPS score by function of time.

# create a random intercept and random slope model
bprs_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)

# print a summary of the model
summary(bprs_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

There are some differences in the model and week is negatively correlated with BRPS.

Lets perform ANOVA to see if one fit is better another.

# perform an ANOVA test on the two models
anova(bprs_ref1, bprs_ref)
## Data: BPRS
## Models:
## bprs_ref: bprs ~ week + treatment + (1 | subject)
## bprs_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## bprs_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## bprs_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is small (0.026) so interaction model provides better fit for treatment BRPS assessment. However the AIC values are very similar.

Plotting the slope and intercept model

Lets combine the two approached and test wether we can achieve better results with it.

# create a random intercept and random slope model with the interaction
bprs_ref2 <-  lmer(bprs ~ week * treatment + (week | subject), data = BPRS, REML = FALSE)
# perform an ANOVA test on the two models
anova(bprs_ref2, bprs_ref1)
## Data: BPRS
## Models:
## bprs_ref1: bprs ~ week + treatment + (week | subject)
## bprs_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## bprs_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## bprs_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the new and improved model is not that improved. It’s scraping the edge of significance with the P-value but ultimately falls sort which means it is not and upgrade. However, lets still use it to plot some fitted values.

# draw the plot of RATSL with the observed Weight values
p1 <- ggplot(BPRS, aes(x = week, y = bprs, colour=treatment)) +
  geom_line(aes(linetype = subject)) +
  scale_linetype_manual(values = rep(1:20, times=2)) +
  scale_x_continuous(name = "Weeks") +
  scale_y_continuous(name = "BRPS") +
  theme(legend.position = "top") +
  theme_bw(12)

# Create a vector of the fitted values
Fitted <- fitted(bprs_ref2)

# Create a new column fitted to RATSL
BPRS$fitted <- Fitted

# draw the plot of RATSL with the Fitted values of weight
p2 <- ggplot(BPRS, aes(x = week, y = fitted, colour=treatment)) +
  geom_line(aes(linetype = subject)) +
  scale_linetype_manual(values = rep(1:20, times=2)) +
  scale_x_continuous(name = "Weeks") +
  scale_y_continuous(name = "BRPS") +
  theme(legend.position = "top") +
  theme_bw(12)

# Combine plots
ggarrange(p1,p2, labels = "AUTO", widths = c(1,1), nrow = 1)
**Figure 5: A)** Base values **B)** Fitted values

Figure 5: A) Base values B) Fitted values

When we compare Figure 5A&B we can see that the trends that we estimated by eye in this case held true. The values indeed predict the observed data and show that accros the treatment weeks both treatments decrease (or neither affects in anyway?) BRPS.